Alternatives To Make Text Searchable
Let’s explore third-party search engines and the inverted index in this lesson.
We'll cover the following
There are some alternatives available if we don’t want to use built-in options in the databases. We can use third-party search engines or totally database-independent search engines. These are discussed in detail in this lesson.
Third-party search engines#
If we need to search text in a way that works the same way regardless of which database brand we use, we need a search engine that runs independently of the SQL database. This section briefly describes two such products, Sphinx Search and Apache Lucene.
Sphinx Search#
Sphinx Search is an open-source search engine technology that integrates well with MySQL and PostgreSQL. As of the time this course was written, an unofficial patch existed for using Sphinx Search with the open-source Firebird database. Perhaps in the future, this search engine will support other databases.
Indexing and searching are fast in Sphinx Search, and it supports distributed queries as well. It’s a good choice for high-scale searching applications that have data that update infrequently.
We can use Sphinx Search to index data stored in a MySQL database. By modifying a few fields in a configuration file sphinx.conf
, we can specify the database. We must also write an SQL query to fetch the data for building the index. The first column in this query is the integer primary key. We may declare some columns as attributes for restricting or sorting results. The remaining columns are those to be full-text indexed. Finally, another SQL query fetches a full row from the database, given a primary key value coded as $id
.
source bugsrc {
type = mysql sql_user = bugsuser sql_pass = xyzzy sql_db = bugsdatabase sql_query = \
SELECT bug_id, status, date_reported, summary, description \
FROM Bugs sql_attr_timestamp = date_reported
sql_attr_str2ordinal = status sql_query_info = SELECT * FROM Bugs WHERE bug_id = $id
}
index bugs {
source
= bugsrc = /opt/local/var/db/sphinx/bugs
path }
Once we declare this configuration in sphinx.conf
, we can create the index at the shell with the indexer
command:
We can search the index using the search command:
Sphinx Search also has a daemon process and an API with which to invoke searches from popular scripting languages such as PHP, Perl, and Ruby. The major disadvantage of this software is that the index algorithm doesn’t support incremental updates efficiently. Using Sphinx Search over a data source that updates frequently requires some compromises. For example, we may need to split our searchable table into two tables: the first to store the majority of historical data that doesn’t change and the second to store a smaller set of current data that grows incrementally and must be reindexed. Then our application must search two Sphinx Search indexes.
Apache Lucene#
Lucene is a mature search engine for Java applications. Work-alike projects exist for other languages, including C++, C#, Perl, Python, Ruby, and PHP.
Lucene builds an index in its proprietary format for a collection of text documents. The Lucene index doesn’t stay in sync with the source data it indexes. If we insert, delete, or update rows in the database, we must apply matching changes to a Lucene index.
Using the Lucene search engine is a bit like using a car engine; we need quite a bit of supporting technology around it to make it useful. Lucene doesn’t read data collections from an SQL database directly. Instead, we have to write documents in the Lucene index. For example, we could run an SQL query and create one Lucene document for each row of the result and save it to the Lucene index. We can use Lucene through its Java API.
Fortunately, Apache also offers a complementary project called Solr. Solr is a server that provides a gateway to a Lucene index. We can add documents to Solr and submit search queries using a REST-like interface so we can use it from any programming language.
We can also direct Solr to connect to a database itself, run a query, and index the results using its DataImportHandler
tool.
Roll your own#
Let’s suppose you don’t want to use proprietary search features, nor do you want to install an independent search engine product. It would be best if you had an efficient, database-independent solution to make text searchable. In this section, we design what’s called an inverted index.
An inverted index is a list of all words one might search for. In a many-to-many relationship, the index associates these words with the text entries that contain the respective word. For example, a word like “crash” can appear in many bugs, and each bug may match many other keywords. This section shows how to design an inverted index.
First, define a table Keywords
to list the keywords for which users will search for, and define an intersection table BugsKeywords
to establish a many-to-many relationship:
Next, let’s add a row to BugsKeywords
for every keyword that matches the description text for a given bug. We can use a substring-match query to determine these matches using LIKE
or regular expressions. This is more costly than the naive searching method described in the Antipattern lesson, but we gain efficiency because we only need to perform the search once. If we save the result in the intersection table, all subsequent searches for the same keyword will be much faster.
Next, we write a stored procedure to make it easier to search for a given keyword. If the word has already been searched, the query is faster because the rows in BugsKeywords
are a list of all the documents that contain the keyword. If no one has searched for the given keyword before, it’s more difficult to collect text entries.
Let’s try the following steps to write the procedure in MySQL:
-
Search for the user-specified keyword. The code returns either the integer primary key from
Keywords.keyword_id
or null if the word has not been seen previously. -
If the word was not found, insert it as a new word.
-
Query for the primary key value generated in
Keywords
. -
Populate the intersection table by searching
Bugs
for rows containing the new keyword. -
Finally, query the full rows from
Bugs
that match thekeyword_id
, whether the keyword was found or had to be inserted as a new entry.
Now we can call this stored procedure and pass the desired keyword. The procedure returns the set of matching bugs, whether it has to calculate the matching bugs and populate the intersection table for a new keyword or whether it simply benefits from the results of an earlier search.
We can do more with this solution: we can define a trigger to populate the intersection table as each new bug is inserted. If we need to support edits to bug descriptions, we may also have to write a trigger to reanalyze text and add or delete rows in the BugsKeywords
table.
The keyword list is populated naturally as users perform searches, so we don’t need to fill the keyword list with every word found in the knowledge base articles. On the other hand, if we can anticipate likely keywords, we can easily run a search for them, thus bearing the initial cost of searching for each keyword so that it doesn’t fall on our users.
Returning to the challenge described at the beginning of this chapter that I faced, I used an inverted index for my knowledge base application as the solution. I also enhanced the Keywords
table with an additional column num_searches
. I incremented this column each time a user searched for a given keyword so I could track which searches were most in demand.